Lab 02

[conflicted] Will prefer haven::is.labelled over any other package.
[conflicted] Will prefer dplyr::filter over any other package.

Learning objectives

By the end of the lab, you will be able to …

  • implement basic variable manipulation
  • create useful frequency tables
  • produce measures of central tendency and variability

Code-along 02

Download and open code-along-02.qmd

Intro. to data

Packages

Load the standard packages.

library(here)
library(tidyverse)
library(gssr)
library(gssrdoc)


Install and load the summarytools package.

install.packages("summarytools")


library(summarytools)

Load your data & codebook

# Get the data only for the 2024 survey respondents
gss24 <- gss_get_yr(2024)

# Load the codebook
data(gss_dict)

Operators in R

Operators in R are symbols directing R to perform various kinds mathematical, logical, and decision operations. A few of the key ones to know before we get started:

To test equality or inequality:
==, !=, >, >=, <, <=

To indicate “and”, “or”, and “not”:
& | !

Assigning values to various data objects: <- -> =

function(argument)

Functions are (most often) verbs, followed by what they will be applied to in parentheses:


do_this(to_this)
do_that(to_this, to_that, with_those)

Variables

Remember, you can access the variables (i.e., columns) using the $ operator, as shown using the table() function.


The variable names are case sensitive. In this dataset, all variables are lowercase.

table(gss24$fefam)

  1   2   3   4 
167 492 899 604 

195 respondents were coded as 2 on this variable. What does that mean?

dplyr grammar

What’s the advantage of dplyr grammar? We can sequence data manipulation!

gss24 |> 
  filter(!is.na(sex))  |>
  group_by(sex) |>
  descr(hrs1,
        stats = "common") |>
  tb() 
# A tibble: 2 × 10
  sex        variable  mean    sd   min   med   max n.valid     n pct.valid
  <dbl+lbl>  <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>     <dbl>
1 1 [male]   sex       41.7  13.7     0    40    89     869  1467      59.2
2 2 [female] sex       37.3  13.7     0    40    89     891  1823      48.9

variable types

pipes

https://sta199-f24.github.io/slides/03-grammar-of-data-transformation-slides.html#/the-pipe

Frequency Distributions

Variable Descriptions

Let’s familiarize ourselves with the premarsx and polviews variables.


In the console, type ?premarsx and hit enter. The Help pane will show you the question text, response options and values.


Now, do the same for polviews.

Table

Run this code to see the frequency table for the premarsx variable. Then, add a line below to also see a table for the polviews variable.

table(gss24$premarsx)

   1    2    3    4 
 357  122  258 1378 

. . .


table(gss24$polviews)

   1    2    3    4    5    6    7 
 140  421  368 1148  381  516  186 

Cross-tabs

The table command also let’s you create a table with two variables.

# 1st variable is the rows, 2nd variable is the columns.
table(gss24$premarsx, gss24$polviews)
   
      1   2   3   4   5   6   7
  1   8  11  13  78  45 132  52
  2   1  10  10  44  18  25   8
  3   3  26  29  91  41  43  16
  4  91 227 187 488 148 145  38

Labels

Use haven::as_factor to see the value labels instead of the value numbers. Then, do the same for polviews.

table(as_factor(gss24$premarsx))

                 always wrong           almost always wrong 
                          357                           122 
         wrong only sometimes              not wrong at all 
                          258                          1378 
                        other                           iap 
                            0                          1126 
                   don't know            I don't have a job 
                           50                             0 
                  dk, na, iap                     no answer 
                            0                             6 
                not imputable                       refused 
                            0                             0 
               skipped on web                    uncodeable 
                           12                             0 
not available in this release    not available in this year 
                            0                             0 
                 see codebook 
                            0 

Labels

table(as_factor(gss24$polviews))

            extremely liberal                       liberal 
                          140                           421 
             slightly liberal  moderate, middle of the road 
                          368                          1148 
        slightly conservative                  conservative 
                          381                           516 
       extremely conservative                    don't know 
                          186                            99 
                          iap            I don't have a job 
                            0                             0 
                  dk, na, iap                     no answer 
                            0                            20 
                not imputable                       refused 
                            0                             0 
               skipped on web                    uncodeable 
                           30                             0 
not available in this release    not available in this year 
                            0                             0 
                 see codebook 
                            0 

Better Labels

Let’s clean up the levels for premarsx.

1gss24$premarsx <- zap_missing(gss24$premarsx)
2gss24$premarsx <- as_factor(gss24$premarsx)
table(gss24$premarsx) 
1
Get rid of all the ‘missing’ levels (just missing)
2
Apply the labels instead of numeric values

        always wrong  almost always wrong wrong only sometimes 
                 357                  122                  258 
    not wrong at all                other 
                1378                    0 

Better Labels cont.

Let’s get rid of the empty levels in premarsx.

gss24$premarsx <- droplevels(gss24$premarsx)
table(gss24$premarsx)

        always wrong  almost always wrong wrong only sometimes 
                 357                  122                  258 
    not wrong at all 
                1378 

Manipulating Variables

For polviews, let’s combine categories to ease interpretation. This is easiest when the levels are numeric.

Let’s remind ourselves what the values correspond with each label.

table(as_factor(gss24$polviews, levels = "both")) # both shows value and label

             [1] extremely liberal                        [2] liberal 
                               140                                421 
              [3] slightly liberal   [4] moderate, middle of the road 
                               368                               1148 
         [5] slightly conservative                   [6] conservative 
                               381                                516 
        [7] extremely conservative                    [NA] don't know 
                               186                                 99 
                          [NA] iap            [NA] I don't have a job 
                                 0                                  0 
                  [NA] dk, na, iap                     [NA] no answer 
                                 0                                 20 
                [NA] not imputable                       [NA] refused 
                                 0                                  0 
               [NA] skipped on web                    [NA] uncodeable 
                                30                                  0 
[NA] not available in this release    [NA] not available in this year 
                                 0                                  0 
                 [NA] see codebook 
                                 0 

Manipulating Variables

1gss24 <- gss24 |>
2 mutate(pol3cat = case_when(
   polviews >= 1 & polviews <= 3 ~ "Liberal",
   polviews == 4 ~ "Moderate",
   polviews >= 5 & polviews <= 7 ~ "Conservative",
3   TRUE ~ NA_character_),
4  pol3cat = factor(pol3cat,
                 levels = c("Liberal", "Moderate", "Conservative"))
  )
1
Save over the dataset with an added variable.
2
Creates a new variable by assigning labels based on values of polviews
3
Set everything else to “missing”
4
Convert the variable to factor with specified level order


can be written as |> or %>%

Frequency Table

Always double check your work.


table(gss24$polviews, gss24$pol3cat)
   
    Liberal Moderate Conservative
  1     140        0            0
  2     421        0            0
  3     368        0            0
  4       0     1148            0
  5       0        0          381
  6       0        0          516
  7       0        0          186

Relative Frequency Table

Make a frequency table. One of summarytools main purposes is to help cleaning and preparing data for further analysis. Pay attention to the missing values. Then, do the same for premarsx.


freq(gss24$pol3cat) 
Frequencies  
gss24$pol3cat  
Type: Factor  

                     Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
------------------ ------ --------- -------------- --------- --------------
           Liberal    929     29.40          29.40     28.07          28.07
          Moderate   1148     36.33          65.73     34.69          62.77
      Conservative   1083     34.27         100.00     32.73          95.50
              <NA>    149                               4.50         100.00
             Total   3309    100.00         100.00    100.00         100.00

Relative Frequency Table

freq(gss24$premarsx) 
Frequencies  
gss24$premarsx  
Type: Factor  

                             Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
-------------------------- ------ --------- -------------- --------- --------------
              always wrong    357     16.88          16.88     10.79          10.79
       almost always wrong    122      5.77          22.65      3.69          14.48
      wrong only sometimes    258     12.20          34.85      7.80          22.27
          not wrong at all   1378     65.15         100.00     41.64          63.92
                      <NA>   1194                              36.08         100.00
                     Total   3309    100.00         100.00    100.00         100.00

Pretty Tables

Using report.nas = FALSE suppresses the missing data.
The headings = FALSE parameter suppresses the heading section. Do the same for premarsx.


freq(gss24$pol3cat, report.nas = FALSE, headings = FALSE) 

                     Freq        %   % Cum.
------------------ ------ -------- --------
           Liberal    929    29.40    29.40
          Moderate   1148    36.33    65.73
      Conservative   1083    34.27   100.00
             Total   3160   100.00   100.00

Pretty Tables

freq(gss24$premarsx, report.nas = FALSE, headings = FALSE) 

                             Freq        %   % Cum.
-------------------------- ------ -------- --------
              always wrong    357    16.88    16.88
       almost always wrong    122     5.77    22.65
      wrong only sometimes    258    12.20    34.85
          not wrong at all   1378    65.15   100.00
                     Total   2115   100.00   100.00

Cross-tab

The table() function gives us the frequencies.


table(gss24$premarsx, gss24$pol3cat)
                      
                       Liberal Moderate Conservative
  always wrong              32       78          229
  almost always wrong       21       44           51
  wrong only sometimes      58       91          100
  not wrong at all         505      488          331


We want to add the column percentages…

Relative Frequency Cross-tab

1ctable(gss24$premarsx, gss24$pol3cat,
2       prop = "c",
3       format = "p",
4       useNA = "no")
1
Change from table() to ctable().
2
The “c” gives column %; “r” would give row %.
3
This adds the % symbols to the table.
4
Exclude the missing levels from the table.

Central Tendency & Variability

Mode

Remember, the mode is the category with the greatest frequency (or the largest percentage). Let’s find it for the premarsx variable.


freq(gss24$premarsx, report.nas = FALSE) 
Frequencies  
gss24$premarsx  
Type: Factor  

                             Freq        %   % Cum.
-------------------------- ------ -------- --------
              always wrong    357    16.88    16.88
       almost always wrong    122     5.77    22.65
      wrong only sometimes    258    12.20    34.85
          not wrong at all   1378    65.15   100.00
                     Total   2115   100.00   100.00

Median

We can use the same table we generated before to identify the median. This time, let’s use dplyr grammar to produce the same table

Remember, use the cumulative percentage to locate the 50th percentile.

1gss24 |>
2  freq(premarsx, report.nas = FALSE) |>
3  tb()
1
Use dplyr grammar, starting with the name of the df and a pipe
2
Use the freq() function as usual
3
Add the tb() function to turn the table into a tibble
# A tibble: 4 × 4
  premarsx              freq   pct pct_cum
  <fct>                <dbl> <dbl>   <dbl>
1 always wrong           357 16.9     16.9
2 almost always wrong    122  5.77    22.6
3 wrong only sometimes   258 12.2     34.8
4 not wrong at all      1378 65.2    100  

Mean

freq(gss24$hrs1, report.nas = FALSE, headings = FALSE) 
Tagged NA values were detected and will be reported as regular NA; use haven::as_factor() to treat them as valid values

                       Freq         %    % Cum.
-------------------- ------ --------- ---------
      89+ hours [89]      6     0.339     0.339
                 [0]     10     0.566     0.905
                 [2]      3     0.170     1.075
                 [3]      4     0.226     1.301
                 [4]      7     0.396     1.697
                 [5]      9     0.509     2.206
                 [6]     11     0.622     2.828
                 [7]      2     0.113     2.941
                 [8]     15     0.848     3.790
                 [9]      4     0.226     4.016
                [10]     19     1.075     5.090
                [12]     12     0.679     5.769
                [13]      3     0.170     5.939
                [14]      3     0.170     6.109
                [15]     18     1.018     7.127
                [16]     10     0.566     7.692
                [17]      1     0.057     7.749
                [18]      4     0.226     7.975
                [19]      2     0.113     8.088
                [20]     58     3.281    11.369
                [21]      3     0.170    11.538
                [22]      5     0.283    11.821
                [23]      4     0.226    12.048
                [24]     21     1.188    13.235
                [25]     44     2.489    15.724
                [26]      5     0.283    16.007
                [27]      4     0.226    16.233
                [28]     10     0.566    16.799
                [29]      1     0.057    16.855
                [30]     68     3.846    20.701
                [31]      3     0.170    20.871
                [32]     32     1.810    22.681
                [33]      3     0.170    22.851
                [34]      8     0.452    23.303
                [35]     46     2.602    25.905
                [36]     29     1.640    27.545
                [37]     18     1.018    28.563
                [38]     17     0.962    29.525
                [39]      5     0.283    29.808
                [40]    697    39.423    69.231
                [41]     12     0.679    69.910
                [42]     23     1.301    71.210
                [43]     14     0.792    72.002
                [44]     15     0.848    72.851
                [45]     92     5.204    78.054
                [46]     15     0.848    78.903
                [47]      2     0.113    79.016
                [48]     21     1.188    80.204
                [49]      3     0.170    80.373
                [50]    143     8.088    88.462
                [51]      3     0.170    88.631
                [52]      7     0.396    89.027
                [53]      2     0.113    89.140
                [54]      5     0.283    89.423
                [55]     31     1.753    91.176
                [56]      5     0.283    91.459
                [58]      4     0.226    91.686
                [59]      2     0.113    91.799
                [60]     70     3.959    95.758
                [61]      1     0.057    95.814
                [62]      2     0.113    95.928
                [64]      2     0.113    96.041
                [65]     13     0.735    96.776
                [66]      1     0.057    96.833
                [67]      2     0.113    96.946
                [68]      2     0.113    97.059
                [69]      1     0.057    97.115
                [70]     22     1.244    98.360
                [72]      5     0.283    98.643
                [75]      3     0.170    98.812
                [77]      1     0.057    98.869
                [78]      1     0.057    98.925
                [80]     16     0.905    99.830
                [83]      1     0.057    99.887
                [84]      1     0.057    99.943
                [85]      1     0.057   100.000
               Total   1768   100.000   100.000

Question 1a. If working, full or part time: how many hours did you work last week, at all jobs?

Mean

1mean(gss24$hrs1, na.rm=TRUE)
median(gss24$hrs1, na.rm=TRUE)
1
na.rm is a logical evaluating to TRUE or FALSE indicating whether NA values should be stripped before the computation proceeds.
[1] 39.44005
[1] 40

Question 1a. If working, full or part time: how many hours did you work last week, at all jobs?

summary()


summary(gss24$hrs1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   35.00   40.00   39.44   45.00   89.00    1541 

Variability: descr()


descr(gss24$hrs1)
Descriptive Statistics  
gss24$hrs1  
Label: Number of hours worked last week  
N: 3309  

                       hrs1
----------------- ---------
             Mean     39.44
          Std.Dev     13.87
              Min      0.00
               Q1     35.00
           Median     40.00
               Q3     45.00
              Max     89.00
              MAD      7.41
              IQR     10.00
               CV      0.35
         Skewness     -0.05
      SE.Skewness      0.06
         Kurtosis      1.54
          N.Valid   1768.00
                N   3309.00
        Pct.Valid     53.43

Variability: descr()

gss24 |> 
  descr(hrs1,
1        stats = "common") |>
  tb() 
1
Which stats to produce. Either “all” (default), “fivenum”, “common” (see Details), or a selection. See ?descr
# A tibble: 1 × 9
  variable  mean    sd   min   med   max n.valid     n pct.valid
  <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>     <dbl>
1 hrs1      39.4  13.9     0    40    89    1768  3309      53.4